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1 AccMon: Automatically Detecting Memory'Related Bugs via Progranfi Counter-Based 
Invariants 

Pin Zhou, Wei Liu, Long Fel, Shan Lu, Feng Qln, Yuanyuan Zhou, Sannuel Midkiff, Josep 
Torrellas 

December 2004 Proceedings of the 37th annual IEEE/ACM International Symposium 
on Microarchitecture MICRO 37 

Publisher: IEEE Computer Society 

Full text available: Qpdf(249.11 KB) Additional Information: full citation , abstract 

This paper makes two contributions to architectural support for software debugging. First, 
It proposes a novel statistics-based, on-the-fly bug detectionmethod called PC-based 
invariant detection. The idea is based on the observation that, in most programs, a given 
memory location is typically accessed by only a few instructions. Therefore, by capturing 
the invariant of the set of PCs that normally access a given variable, we can detect 
accesses by outlier instructions, which are often caused by ... 



2 Efficient and flexible architectural support for dynamic monitoring 

Yuanyuan Zhou, Pin Zhou, Feng Qin, Wei Liu, Josep Torrellas 

March 2005 ACM Transactions on Architecture and Code Optimization (TACO), volume 2 

Issue 1 
Publisher: ACM Press 

Full text available: ^pdf(524.21 KB) Additional Information: full citation , abstract , refere nces, i ndex terms 

Recent impressive perfornnance improvements in computer architecture have not led to 
significant gains in the case of debugging. Software debugging often relies on inserting 
run-time software checks. In many cases, however, it is hard to find the root cause of a 
bug. I^oreover, program execution typically slows down significantly, often by 10—100 
times.To address this problem, this paper introduces the intelligent watcher (iWatcher), a 
novel architectural scheme to monitor dynamic executio ... 

Keywords: Architectural support, dynamic monitoring, software debugging, thread-level 
speculation (TLS) 



Cache Simulation Based on Runtime Instrumentation for QpenMP Applications 
Jie Tao, Josef Weidendorfer 

April 2004 Proceedings of the 37th annual symposium on Simulation ANSS '04 
Publisher: IEEE Computer Society 
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Full text available: 'Q pdf(1 50.92 KB) Additional Information: full citation , abstract , index terms 

To enable optimizations in memory access befiavior ofliigh performance applications, 
cache monitoring is a crucialprocess. Simulation of cache hardware is needed in orderto 
allow research for non-existing cache architectures,and on the other hand, to get more 
insight into metrics notmeasured by hardware counters in existing processors. One focus of 
EP-Cache, a project investigating efficientprogramming on cache architectures, is on 
developingcache monitoring hardware to give precise informationabou ... 

4 Compiler-directed run-time monitoring of program data access 
^ Chen Ding, Yutao Zhong 

^ June 2002 ACM SIGPLAN Notices , Proceedings of the 2002 worl<shop on i^lemory 
system performance MSP '02, Volume 38 issue 2 supplement 
Publisher: ACM Press 

Full text available: ^pdfn.40MB) Additional Information: full citation , abstract , references , citings 

Accurate run-time analysis has been expensive for complex programs, in part because 
most methods perform on all a data. Some applications require only partial reorganization. 
An example of this is off-loading Infrequently used data from a mobile device. Complete 
monitoring Is not necessary because not all accesses can reach the displaced data. To 
support partial monitoring, this paper presents a framework that includes a source-to- 
source C compiler and a run-time monitor. The compiler inserts ru ... 




5 iWatcher: Efficient Architectural Support for Software Debugging 
^ Pin Zhou, Feng Qin, Wei Liu, Yuanyuan Zhou, Josep Torrellas 

March 2004 ACM SIGARCH Computer Architecture News , Proceedings of tiie 31st 
annual international symposium on Computer architecture ISCA '04, 

Volume 32 Issue 2 
Publisher: IEEE Connputer Society, ACM Press 

Full text available: ^ pdf(314.11 KB) Additional Information: full citation , abstract , citings 

Recent impressive performance improvements in computer architecturehave not led to 
significant gains in ease of debugging. Software debugging often relies on inserting run- 
time softwarechecks. In many cases, however, it is hard to find the root causeof a bug. 
■Moreover, program execution typically slows down significantly,often by 10-100 times.To 
address this problem, this paper introduces the IntelligentWatcher (iWatcher), novel 
architectural support to monitor dynamicexecution with minimal overh ... 




6 RaceTrack: efficient detection of data race conditions via adaptive tracking | 
Yuan Yu, Tom Rodeheffer, Wei Chen 

October 2005 ACM SIGOPS Operating Systems Review , Proceedings of the twentieth 
ACM symposium on Operating systems principles SOSP '05, volume 39 issue 

5 

Publisher: ACM Press 

Full text available: ^ pdf(321.34 KB) Additional Information: full citation , abstract , references , index terms 

Bugs due to data races in multithreaded programs often exhibit non-deterministic 
symptoms and are notoriously-difficult to find. This paper describes RaceTrack, a dynamic 
race detection tool that tracks the actions of a program and reports a warning whenever a 
suspicious pattern of activity has been observed. RaceTrack uses a novel hybrid detection 
algorithm and employs an adaptive approach that automatically directs more effort to 
areas that are more suspicious, thus providing more accurate war .., 

Keywords: race detection, virtual machine instrumentation 
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Architectural support for perfornnance tuning: a case study on the SPARCcenter 2000 
A. SInghal, A. J. Goldberg 



http://portaLacm.org/resultsxfm?coll=ACM&dl=ACM&CFID=62877^^ 11/16/2006 



Results (page 1): instrument monitor memory access 



Page 3 of 6 



April 1994 ACM SIGARCH Computer Architecture News , Proceedings of the 21ST 

annual international symposium on Computer architecture ISCA '94, volume 

22 Issue 2 

Publisher: IEEE Computer Society Press, ACM Press 

Full text available* 1?i pdff1.37 MB) Additional Information: full citation , abstract , references , citings , index 
^ terms 

Latency hiding techniques such as multilevel cache hierarchies yield high performance 
when applications map well onto hierarchy implementations, but performance can suffer 
drastically when they do not. Identifying and reducing mismatches between an application 
and the memory hierarchy is difficult without insight into the actual behavior of the 
hardware implementation. We advocate the use of hardware event counters, as a cheap, 
effective and practical way to tune applications for a given hardwar ... 

Identifying and Exploiting Spatial Regularity in Data Memory References 
Tushar Mohan, Bronis R. de Supinski, Sally A. McKee, Frank Mueller, Andy Yoo, Martin Schuiz 
November 2003 Proceedings of the 2003 ACM/IEEE conference on Supercomputing 
Publisher: IEEE Computer Society 

Full text available: ^ pdf(264 .75 KB) Additional Information: full citation , abstract 

The growing processor/memory performance gap causes the performance of many codes 
to be limited by memory accesses. If known to exist in an application, strided memory 
accesses forming streams can be targeted by optimizations such as prefetching, 
relocation, remapping, and vector loads. Undetected, they can be a significant source of 
memory stalls in loops. Existing stream-detection mechanisms either require special 
hardware, which may not gather statistics for subsequent analysis, or are limite ... 

Track 4: reconfigurable computing (part 2): Owl: next generation system monitoring 
Martin Schuiz, Brian S. White, Sally A. McKee, Hsien-Hsin S. Lee, Jurgen Jeitner 
May 2005 Proceedings of the 2ncl conference on Computing frontiers 
Publisher: ACM Press 

Full text available: ^ pdf(430.90 KB) Additional Information: full citation , abstract , references , index terms 

As microarchitectural and system complexity grows, comprehending system behavior 
becomes increasingly difficult, and often requires obtaining and sifting through voluminous 
event traces or coordinating results from multiple, non-localized sources. Owl is a 
proposed framework that overcomes limitations faced by traditional performance counters 
and monitoring facilities in dealing with such complexity by pervasively deploying 
programmable monitoring elements throughout a system. The design exploit ... 

Keywords: autonomous performance monitoring, performance analysis, reconfiguration 



SMP system interconnect instrumentation for performance analysis 
Lisa Noordergraaf, Robert Zak 

November 2002 Proceedings of the 2002 ACM/IEEE conference on Supercomputing 

Publisher: IEEE Computer Society Press 

Full text available: t apdf(397.28 KB) Additional Information: full citation , abstract, references , dtings. index 
^^^""^ terms 

The system interconnect is often the performance bottleneck in SMP computers. Although 
modern SMPs include event counters on processors and interconnects, these provide 
limited information about the interaction of processors vying for shared resources. 
Additionally, transaction sources and addresses are not readily available, making analysis 
of access patterns and data locality difficult. Enhanced system interconnect 
instrumentation is required to extract this information.This paper describes in ... 
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11 Software visualization for specific domains: Interactive locality optimization on NUMA 
architectures 

Tao Mu, Jie Tao, Martin Schuiz, Sally A. McKee 

June 2003 Proceedings of the 2003 ACM symposium on Software visualization 
Publisher: ACM Press 

Full text available: ^Ddf(743.51 KB) Additional Information: full citation , abstract , references , index temis 

Optinnlzing the performance of shared-memory NUMA programs remains something of a 
black art, requiring that application writers possess deep understanding of their programs' 
behaviors. This difficulty represents one of the remaining hindrances to the widespread 
adoption and deployment of these cost-efficient and scalable shared-memory NUMA 
architectures. To address this problem, we have developed a performance monitoring 
infrastructure and a corresponding set of tools to aid in visualizing and un ... 

Keywords: NUMA Architectures, distributed systems, interactive locality optimizations, 
performance visualization 



12 Nonintrusive precision instrumentation of microcontroller software 
^ Ben L. Titzer, Jens Palsberg 

June 2005 ACIW| SIGPLAN Notices /Proceedings of the 2005 ACI^I SIGPLAN/SIGBED 
conference on Languages, compilers, and tools for embedded systems 
LCTES '05, Volume 40 Issue 7 
Publisher: ACM Press 

Full text available- f flpdf(188 83 KB) Additional Information: full citation, abstract, references , dtings, Index 
' terms 

Debugging, testing, and profiling microcontroller programs are notoriously difficult. The 
lack of supporting software such as an operating system, a narrow interface to the 
hardware chip, and delicately timed sequences of code present significant challenges 
which can be exacerbated by the presence of additional debugging or profiling code. In 
this paper we present a solution to the precision instrumentation problem for 
microcontroller code that Is based upon our open, flexible simulator framewor ... 

Keywords: cycle-accurate, debugging, instruction-level simulation, instrumentation, 
monitoring, parallel simulation, profiling, sensor networks 



13 Using Hardware Counters to Automatically Improve Memory Performance 

Mustafa M. Tikir, Jeffrey K. Hollingsworth 

November 2004 Proceedings of the 2004 ACM/IEEE conference on Supercomputing 
Publisher: IEEE Computer Society 

Full text available: ^pdf(1 52.84 KB) Additional Information: full citation , abstract 

In this paper, we introduce a profile-driven online page migration scheme and investigate 
its impact on the performance of multithreaded applications. We use lightweight, 
inexpensive plug-in hardware counters to profile the memory access behavior of an 
application, and then migrate pages to memory local to the most frequently accessing 
processor. Using the Dyninst runtime instrumentation combined with hardware counters, 
we were able to add page migration capabilities to the system without having ... 

SPiKE: engineering malware analysis tools usina unobtrusive binary-instrumentation 
Amlt Vasudevan, Ramesh Yerraballi 

January 2006 Proceedings of the 29th Australasian Computer Science Conference - 

Volume 48 ACSC '06 
Publisher: Australian Computer Society, Inc. 
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Full text available: ^ pdf(832.66 KB) Additional Information: full citation , abstract , references , index terms 

Malware — a generic term that encompasses viruses, trojans, spywares and other 
intrusive code — is widespread today. Malware analysis is a multi-step process providing 
Insight into malware structure and functionality, facilitating the development of an 
antidote. Behavior monitoring, an Important step in the analysis process, is used to 
observe malware interaction with respect to the system and is achieved by employing 
dynamic coarse-grained binary-Instrumentation on the target system. However, ... 

Keywords: instrumentation, malware, security 



15 Informing memory operations: memory performance feedback mechanisms and their Q 
^ a pplications 

^ Mark Horowitz, Margaret Martonosi, Todd C. Mowry, Michael D. Smith 

May 1998 ACM Transactions on Computer Systems (TOCS), Volume 16 issue 2 
Publisher: ACM Press 

Full text available: ^ pdf(344.74 KB) Additional Information: full citation, abstract , references , citings , index 

terms , review 

Memory latency is an Important bottleneck in system performance that cannot be 
adequately solved by hardware alone. Several promising software techniques have been 
shown to address this problem successfully in specific situations. However, the generality 
of these software approaches has been limited because current architecturtes do not 
provide a fine-grained, low-overhead mechanism for observing and reacting to memory 
behavior directly. To fill this need, this article proposes a new class ... 

Keywords: cache miss notification, memory latency, processor architecture 



Execution analysis of DSM applications: a distributed and scalable approach 
Lionel Brunie, Laurent Lefevre, Olivier Reymann 

January 1996 Proceedings of the SIGMETRICS symposium on Parallel and distributed 

tools 
Publisher: ACM Press 

Full text available: ^ pdf(1.06MB) Additional Information: full citation , references , citings , index terms 



Keywords: distributed shared memory, monitoring, performance evaluation, program 
visualization 



''^ Cache performance analysis of traversals and random accesses 
Richard E. Ladner, James D. Fix, Anthony LaMarca 

January 1999 Proceedings of the tenth annual ACM-SIAM symposium on Discrete 
algorithms 

Publisher: Society for Industrial and Applied Mathematics 

Full text available: ' Qpdf(1.07 MB) Additional Infomiation: full citation , references , citings , index terms 



A two-tier nriemory architecture for high-performance multiprocessor systems 
T. M. Nguyen, V. P. Srini, A. M. Despain 

June 1988 Proceedings of the 2nd International conference on Supercomputi ng 
Publisher: ACM Press 
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terms 

Performance of high-speed multiprocessor systems is limited by the available bandwidth to 
memory and the need to synchronize write sharable data. This paper presents a new 
memory system that separates synchronization related data from others. The memory 
system has two tiers: synchronization memory and high bandwidth (HB) memory. The 
synchronization memory consists of snooping caches connected to a bus and is used to 
store synchronization variables such as locks and semaphores. The H ... 

Shared-memory performance profiling 
^ Zhichen Xu, James R. Larus, Barton P. Miller 

June 1997 ACM SIGPLAN Notices , Proceedings of the sixthi ACM SIGPLAN symposium 
on Principles and practice of parallel programming PPOPP '97, Volume 32 

Issue 7 
Publisher: ACM Press 

Full text available Ddfd 19 MB) Additional Information: full citation , abstract , references , citings , index 
*^='"^"" ^ terms 

This paper describes a new approach to finding performance bottlenecks in shared- 
memory parallel programs and its embodiment in the Paradyn Parallel Performance Tools 
running with the Blizzard fine-grain distributed shared memory system. This approach 
exploits the underlying system's cache coherence protocol to detect data sharing patterns 
that indicate potential performance bottlenecks and presents performance measurements 
in a data-centric manner. As a demonstration, Parodyn helped us improve ... 

20 Infornriinq memory operations: providing memory performance feedback in modern 
processors 

Mark Horowitz, Margaret MartonosI, Todd C. Mowry/ Michael D. Smith 
May 1996 ACM SIGARCH Computer Architecture News , Proceedings of the 23rd 

annual international symposium on Computer architecture ISCA '96, volume 
24 Issue 2 
Publisher: ACM Press 

Full text available- 155 Ddfn 55 MB) Additional Information: full citation , abstract , references , citings , index 
'^i^-^ ^ terms 

Memory latency is an important bottleneck in system performance that cannot be 
adequately solved by hardware alone. Several promising software techniques have been 
shown to address this problem successfully in specific situations. However, the generality 
of these software approaches has been limited because current architectures do not 
provide a fine-grained, low-overhead mechanism for observing and reacting to memory 
behavior directly. To fill this need, we propose a new class of memory operati ... 
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